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The Internet has become essential to all aspects of modern life, and thus the consequences 
of network disruption have become increasingly severe. It is widely recognised that the 
Internet is not sufficiently resilient, survivable, and dependable, and that significant 
research, development, and engineering is necessary to improve the situation. This paper 
provides an architectural framework for resilience and survivability in communication net- 
works and provides a survey of the disciplines that resilience encompasses, along with sig- 
nificant past failures of the network infrastructure. A resilience strategy is presented to 
defend against, detect, and remediate challenges, a set of principles for designing resilient 
networks is presented, and techniques are described to analyse network resilience. 
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1. Introduction and motivation 


Networks in general, and the Global Internet in particu- 
lar, have become essential for the routine operation of busi- 
nesses and to the global economy. Consumers use the 
Internet to access information, obtain products and ser- 
vices, manage finances, and communicate with one an- 
other. Businesses use the Internet to transact commerce 
with consumers and other businesses. Governments de- 
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pend on networks for their daily operation, service delivery, 
and response to disasters. The military depends on the Glo- 
bal Information Grid [1] to execute network centric opera- 
tions and warfare. The Global Internet may thus be 
described as one of the critical infrastructures on which 
our lives and prosperity depends, along with transportation 
infrastructure, power generation and distribution grid [2]. 
Furthermore, many of these infrastructures have depen- 
dencies on one another [3]. The canonical example is that 
the Internet depends on the electrical grid for power, while 
the electrical grid increasingly depends on the Internet for 
SCADA (supervisory control and data acquisition) [4]. 
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However, the increased dependence on, and sophistica- 
tion of services make the internet more vulnerable to prob- 
lems. With a continuously increasing reliance come two 
related consequences: First, this reliance results in increas- 
ing consequences of disruption. Second, the increased con- 
sequences of disruption lead to networks becoming a more 
attractive target for cyber-criminals. We believe that resil- 
ience must be viewed as an essential design and opera- 
tional characteristic of future networks in general, and 
the Global Internet in particular. The vulnerabilities of 
the current Internet and the need for greater resilience 
are widely recognised [4-8]. In our work, and in this paper, 
we define resilience as the ability of the network to provide 
and maintain an acceptable level of service in the face of var- 
ious faults and challenges to normal operation. 

This paper provides a broad overview of the discipline 
of resilience and presents a systematic architectural 
framework based on several major research initiatives, 
and is organised as follows: Section 2 introduces the scien- 
tific disciplines that form the basis for resilience and pre- 
sents them systematically. Section 3 describes a selection 
of significant past challenges and failures of the network 
infrastructure in the context of threats to the network. 
Section 4 presents strategies for achieving network resil- 
ience and survivability. Next, Section 5 describes a set of 
design principles to achieve resilience based on the strat- 
egy as well as from experience in the resilience disciplines. 
Section 6 describes the need for analysis of resilience along 
with a state space formulation and example application of 
a large-scale disaster. Finally, Section 7 summarises the 
main points of the paper and suggests further directions 
of research. 


2. Resilience disciplines 


There are a number of relevant disciplines that serve as 
the basis of network resilience, and for which our broad 
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definition of resilience subsumes. Because these disciplines 
have developed independently over a number of decades, 
there is no established self-consistent schema and termi- 
nology. This section will introduce these disciplines and 
describe their organisation within the resilience domain, 
after introducing the important concept of the fault — 
error — failure chain. 


2.1. Fault > error > failure chain 


A fault is a flaw in the system that can cause an error 
[9,10]. This can either be an accidental design flaw (such 
as a software bug), or an intentional flaw due to con- 
straints that permit an external challenge to cause an er- 
ror, such as not designing a sufficiently rugged system 
due to cost constraints. A dormant fault may be trig- 
gered, leading to an active fault, which may be observa- 
ble as an error. An error is a deviation between an 
observed value or state and its specified correct value 
or state [10-12] that may lead to a subsequent service 
failure [9]. A service failure (frequently shortened to fail- 
ure) is a deviation of service from the desired system 
functioning to not meeting its specification or expecta- 
tion [9,13,11,12,14]. Thus a fault may be triggered to 
cause an observable error, which may result in a failure 
if the error is manifest in a way that causes the system 
not to meet its service specification. This relationship is 
shown in Fig. 1; the boxes labelled defend and detect 
are part of the resilience strategy that will be explained 
in Section 4. For now, note that network defences may 
prevent challenges from triggering a fault and that many 
observable errors do not result in a failure. Disruption 
tolerance (Section 2.2.3) is one example of reducing the 
impacts of fault and errors on service delivery. Further- 
more, challenges and errors can be detected, which also 
provides a basis for actions taken as part of a resilience 
strategy. 
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Fig. 1. Fault > error — failure chain. 


J.P.G. Sterbenz et al./Computer Networks 54 (2010) 1245-1265 1247 






Challenge Tolerance 











Disruption 


Survivability Tolerarice 


many v targetted 
failures 













environmental 






Fault Tolerance 
(few A random) 


Traffic 
Tolerance 









legitimate flash crowd 





attack DDoS 


Robustness 
Complexity 














Trustworthiness 











Dependability 


reliability ) Cmaintainability 


availability 
confidentiality Security onrepudiabilit 


AAA 


auditability authorisabilit 
authenticity 


Performability 


QoS measures 















Fig. 2. Resilience disciplines. 


2.2. Disciplines relating to challenge tolerance 


At the highest level, we divide the disciplines into two 
categories, as shown in Fig. 2. On the left side are challenge 
tolerance disciplines that deal with the design and engi- 
neering of systems that continue to provide service in the 
face of challenges. On the right sides are trustworthiness 
disciplines that describe measurable properties of resilient 
systems. The relationship between these two is robustness, 
which formally is the performance of a control system 
when perturbed, or in our context, the trustworthiness of 
a system when challenged. Note that a comprehensive sur- 
vey of these fields would occupy far more space than is 
available in this paper. 

The first major subset of resilience disciplines deals 
with the problem of how to design systems to tolerate 
the challenges that prevent the desired service delivery. 
These challenges can be subdivided into (1) component 
and system failures for which fault tolerance and survivabil- 
ity are concerned, (2) disruptions of communication paths 
for which disruption tolerance is concerned, and (3) chal- 
lenges due to the injection of traffic into the network, for 
which traffic tolerance is concerned. 


2.2.1. Fault tolerance 

Fault tolerance is one of the oldest resilience disci- 
plines, and is defined as the ability of a system to tolerate 
faults such that service failures do not result [13,11,12]. 
While the use of redundancy to cover for failures in phys- 
ical systems dates back centuries, perhaps the first refer- 
ence in a computing context was the introduction of N- 
version programming in the context of the Babbage dif- 
ference engine [15] in the mid 1800s. Fault tolerance 
emerged as a modern discipline in the 1950s when von 
Neumann and Shannon devised techniques to design reli- 
able telephone switching systems out of relatively unreli- 
able mechanical relays [16]. Fault tolerance was also 
applied to computer system design in the 1960s, particu- 


larly for mission critical systems used in defence and 
aerospace [17,18]. 

Fault tolerance relies on redundancy as a technique to 
compensate for the random uncorrelated failure of compo- 
nents. Fault tolerance techniques can be applied to both 
hardware, such as triple-modular redundancy [19], and 
to software, such as N-version programming [20] and 
recovery blocks [21], and are generally sufficient when 
applied to systems of limited geographic scope. Fault toler- 
ance is not sufficient to provide coverage in the face of cor- 
related failures, and therefore is necessary but not 
sufficient to provide resilience. Thus, fault tolerance can 
be considered a subset of survivability, which considers 
multiple correlated failures, as described in the next 
section. 

It is important to note that the optical networking com- 
munity uses the term survivability to mean link- and node- 
level fault tolerance. Techniques such as SONET/SDH auto- 
matic protection switching [22] and p-cycles [23] are fault 
tolerance techniques applied to a network graph. Note that 
shared link risk groups (SLRGs) [24] provide topological 
diversity, but not necessarily geographic diversity. 


2.2.2. Survivability 

The emergence of and dependence on the Internet lead 
to the realisation that new techniques were needed for un- 
bounded networks that could be affected by correlated fail- 
ures for which fault-tolerant design techniques are not 
sufficient. Survivability is the capability of a system to ful- 
fill its mission, in a timely manner, in the presence of 
threats such as attacks or large-scale natural disasters. This 
definition captures the aspect of correlated failures due to 
an attack by an intelligent adversary [25,26], as well as fail- 
ures of large parts of the network infrastructure [27,28]. 

In addition to the redundancy required by fault toler- 
ance, survivability requires diversity so that the same fate 
is unlikely to be shared by parts of the system undergoing 
correlated failures. 
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While survivability is significantly more difficult to 
quantify than fault tolerance, it has been formalised as a 
set-theoretic and state-machine based formulation [29]: 


Survivability = {S,E,D,V,T, P} 


where S is the set of acceptable service specifications, E de- 
scribes the ways in which the system can degrade based on 
external challenges, D are the practical values of E, V is the 
relative ordering of service values Sx D, TCSxS~xD is 
the set of valid transitions between service states S given 
a challenge D, and P are the service probabilities that some 
s © S must meet dependability requirements. 
Survivability has also been quantified using multi- 
dimensional Markov chains to consider simultaneous fail- 
ures [30], in which one dimension captures the failure of 
components and the other dimension the performability. 


2.2.3. Disruption tolerance 

Another major type of challenge that is unique to 
communication networks comes from challenges in the 
communication environment that make it difficult to 
maintain stable end-to-end connections between users. 
Disruption tolerance is the ability of a system to tolerate 
disruptions in connectivity among its components, consist- 
ing of the environmental challenges: weak and episodic 
channel connectivity, mobility, unpredictably-long delay, 
as well as tolerance of energy (or power) challenges. 

There are three major contributors to the field of dis- 
ruption tolerance. The first is motivated by dynamic net- 
work behaviour and began with wireless packet radio 
networks [31]. This led to further research in mobile ad 
hoc networks (MANETs) that have proposed forwarding 
and routing mechanisms for dynamic networks in which 
the connectivity among members is continually changing 
[32,33]. The second contributor was motivated by very 
large delays that traditional network protocols can not tol- 
erate, specifically the satellite and space environment [34] 
and the Interplanetary Internet (IPN) [35]. This research 
led to more general notions of delay-tolerant networking 
[36] and disruption-tolerant networking in which stable 
end-to-end paths may never exist. Techniques that support 
such networks include communicating as far as possible 
but reverting to store-and-forward when necessary, and 
mobile nodes carrying information, called store-and-haul 
[26], store-carry-forward [37], or ferrying [38]. The third 
contributor is energy-constrained networks, exemplified 
by wireless sensor networks [39], in which nodes that have 
drained their battery can no longer contribute to network 
connectivity [40,41]. 

More recently, disruption-tolerant networking tech- 
niques have found application in a number of domain-spe- 
cific scenarios, including vehicle ad hoc networks (VANETs) 
[42-44], weather disruption-tolerant networks [45], and 
highly-dynamic airborne networks [46]. 


2.2.4, Traffic tolerance 

The last major challenge category is that caused by the 
injection of traffic into the network. Traffic tolerance is the 
ability of a system to tolerate unpredictable offered load 
without a significant drop in carried load (including con- 
gestion collapse), as well as to isolate the effects from cross 


traffic, other flows, and other nodes. In defining traffic as a 
challenge, we mean traffic beyond the design parameters 
of the network in its normal operation (Section 4). Traffic 
challenges can either be unexpected but legitimate such 
as from a flash crowd [47], or malicious such as from a dis- 
tributed denial-of-service (DDoS) attack [48]. It is impor- 
tant to note that while DDoS detection is an important 
endeavour, network resources are impacted regardless of 
whether traffic is malicious or not. Furthermore, a suffi- 
ciently sophisticated DDoS attack is indistinguishable from 
normal traffic, and thus traffic tolerance mechanisms are 
important whether or not attack detection mechanisms 
are successful. 


2.3. Disciplines relating to trustworthiness 


Trustworthiness is defined as the assurance that a sys- 
tem will perform as expected [49], which must be with 
respect to measurable properties. The trustworthiness dis- 
ciplines therefore measure service delivery of a network, 
and consist of (1) dependability, (2) security, and (3) 
performability. 


2.3.1. Dependability 

Dependability is the discipline that quantifies the reli- 
ance that can be placed on the service delivered by a sys- 
tem [50,9], and consists of two major aspects: availability 
and reliability. Important to both of these aspects are the 
expected values of the failure and repair density functions. 
The basic measures of dependability are the MTTF (mean 
time to failure), which is the expected value of the failure 
density function, and the MTTR, which is the expected va- 
lue of the repair density function. The mean time between 
failure is the sum of these two [51]: 


MTBF = MTTF + MTTR 


Availability is readiness for usage, which is the probability 
that a system or service will be operable when needed, 
and is calculated as 


A = MTTF/MTBF 


Reliability is continuity of service, that is the probability that 
a system or service remains operable for a specified period 
of time: 


R(t) = Pr[no failure in(0, t]] = 1 — Q(t) 


where Q(t) is the failure cumulative distribution function. 

These notions of dependable systems have been codi- 
fied by IFIP WG 10.4 [14] and ANSI T1A1 [52] and are com- 
monly applied to network dependability. These notions of 
reliability are also applied to fibre-optic links as a measure 
of fault tolerance [53,54]. 

The relative importance of availability and reliability 
depend on the application service. Availability is of pri- 
mary importance for transactional services such as HTTP- 
based Web browsing: as long as the server is usually up, 
it matters less if it fails frequently as long as the MTTR is 
very short. On the other hand, reliability is of prime impor- 
tance for session- and connection-oriented services such as 
teleconferencing: to be useful, the session must remain up 
for a specified period of time requiring a long MTTF. 
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Additionally, there are several other aspects of depend- 
ability [9]: Maintainability is the aptitude to undergo re- 
pairs and evolutions. Safety is dependability with respect 
to catastrophic failures [55]. Safety is of particular concern 
for critical infrastructures such as the power grid and 
nuclear power plants, and their interdependence on the 
Internet and SCADA (supervisory control and data acquisi- 
tion) networks. Integrity is an aspect of dependability that 
is more commonly associated with security, which is de- 
scribed next. 


2.3.2. Security 

Security is the property of a system, and the measures ta- 
ken such that it protects itself from unauthorised access or 
change, subject to policy [56]. Security properties include 
AAA (authenticity, authorisability, auditability), confidential- 
ity, and nonrepudiability. Security shares with dependability 
the properties of availability and integrity [14]. In the con- 
text of trustworthiness, we are concerned with the measur- 
able properties of the security aspects [57,58]. We can also 
consider security to be related to a level of self-protection 
[59], which is an enabling principle for resilience. 


2.3.3. Performability 

Performability [60] is the property of a system such that 
it delivers performance required by the service specifica- 
tion, as described by QoS (quality of service) measures 
such as delay, throughout or goodput, and packet delivery 
ratio [61-66]. Performability extends the binary depend- 
ability concepts to the range of degradable systems. 


2.4, Robustness and complexity 


Two disciplines lie outside challenge tolerance and 
trustworthiness, but describe their relationship to one an- 
other (robustness) and overall characteristics (complexity). 


2.4.1. Robustness 

Robustness is a control-theoretic property that relates 
the operation of a system to perturbations of its inputs 
[67,68]. In the context of resilience, robustness describes 
the trustworthiness (quantifiable behaviour) of a system 
in the face of challenges that change its behaviour. Note 
that the term robustness is frequently used in a much less 
precise manner that is synonymous with resilience, surviv- 
ability, or security. 


2.4.2. Complexity 

Complexity refers to the ways in which large numbers of 
systems interact, resulting in emergent behaviour [69]. 
Complexity science has an important relationship to resil- 
ience and the robustness of systems, because resilience 
mechanisms such as self-organisation and autonomic 
behaviour increase complexity, and increased complexity 
may result in greater network vulnerability. 


3. Challenges and past failures 


This section describes the challenges to the Global 
Internet and interdependent networks such as the PSTN 


that motivate the need for resilience. Interwoven among 
challenge types is a selected set of past events and failures 
that have seriously affected these networks. Challenges are 
any characteristic or condition that impacts the normal 
operation of the network, consisting of unintentional 
mis-configuration or operational mistakes, large-scale nat- 
ural or human-caused disasters, malicious attacks from 
intelligent adversaries, environmental challenges (mobil- 
ity, weak channels, unpredictably long delay, constrained 
energy), unusual but legitimate traffic load, and service 
failures at a lower level. 


3.1. Unusual but legitimate traffic load 


Anon-malicious request for service that places a greater 
(or different along some other dimension) than the normal 
operation has been engineered to cope with, is a challenge 
to the network. This is commonly caused by a flash crowd 
[47] when an event triggers a large volume of service re- 
quests beyond the normal load. In addition to affecting 
the target of the flash crowd, the network as a whole can 
be affected, particularly by cross traffic near the target 
[70]. A secondary effect of many of the challenges listed 
in the following subsections is a flash crowd due to emer- 
gency response and the population trying to obtain news 
about what has happened. 


3.2. Accidents and human mistakes 


Accidents and mistakes are non-malicious, made by 
people who interact with the system, such as unintentional 
device mis-configuration or not following correct policy. 
These may occur during system design or operation and 
can become more pernicious if the parties involved try to 
cover up their mistakes. Sometimes accidents and mis- 
takes can have very significant consequences, as described 
in the following paragraphs. 

A human error caused Google to flag every search result 
with the message “This site may harm your computer”. 
Usually only sites that are suspected of installing malicious 
software are kept in the database. On 01 February 2009 the 
‘/’ string was mistakenly inserted into this list, which ex- 
panded to all URLs and caused the erroneous flagging [71]. 

A large-scale blackout affected much of the Northeast- 
ern United States and the province of Ontario on 14 August 
2003, affecting 50 million people. Many interrelated fac- 
tors came into play in turning a few operational problems 
into a major failure. Three power-plant outages on a hot 
summer day resulted in insufficient reactive reserves being 
available. At about the same time, automatic alarm soft- 
ware processes were left turned off due to human error. 
The result was that when three high-voltage power-lines 
went down due to insufficient tree-trimming practices 
the load was not properly rerouted and instead caused a 
large portion of the power grid (15 more high-voltage 
lines) in northern Ohio to collapse. This caused surges 
and cascading failures in neighbouring connected grids, 
spreading faster than automatic protection relays were 
able to trip. Eventually enough failsafes and line-outages 
occurred, thus isolating the less robust portions of the grid 
and stopping the cascading failures, at the same time 
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partitioning the grid into various islands. Full service was 
not restored in some areas for over a week. The costs, both 
in terms of repairs and lost productivity, were on the order 
of 10 billion dollars [72]. This large-scale power failure had 
a significant impact on the interrelated Internet infrastruc- 
ture, when over 2000 globally advertised prefixes had se- 
vere outages of 2 hours or longer, affecting nearly 50% of 
all Internet autonomous systems [73]. This is the canonical 
example of infrastructure interdependence with the inter- 
net relying on the power grid for equipment to stay oper- 
ational, while at the same time many SCADA power- 
control systems are communicating using Internet-based 
services. A lack of understanding of the complexity of the 
grid as a whole, as well as a number of human mistakes 
both in planning and in remediation decisions contributed 
to the extent and severity of this blackout. 

On 8 May 1988 a fire at the Illinois Bell switching office 
in Hinsdale caused severe damage to phone services in 
Northern Illinois in the US. The outage affected local, 
long-distance, mobile telephone, 800 service, and air-traf- 
fic control communication between Midway and O’Hare 
Airports in Chicago and the FAA Center in Aurora, Illinois. 
It took until the end of May to restore service. Although 
the PTSN (public switched telephone network) contained 
hardware and link redundancy, services failed because 
both the primary and the backup system were located in 
the same building, and were both destroyed in the fire 
resulting from a lightning strike. The Hinsdale fire is a 
canonical example of how fault tolerance alone is not suf- 
ficient for resilience. A significant fraction of PSTN failures 
are due to accidents and human mistakes [74]. 

On 18 July 2001 a train derailed in the Howard Street 
Tunnel in Baltimore, Maryland in the Northeast US. The 
subsequent fire caused fibre backbone disruptions experi- 
enced by seven major ISPs [75,76]. This tunnel was a con- 
venient place to route fibre conduits under the city of 
Baltimore. Most traffic was rerouted, but resulted in con- 
gestion on alternative links, causing a noticeable slow- 
down for a significant portion of the Internet. New fibre 
strands were laid to restore physical capacity within 36 
hours. In the case of both the Baltimore tunnel and Hins- 
dale fires, design choices had been made that resulted in 
redundant systems being deployed; however without geo- 
graphic diversity the redundant infrastructure shared the 
same fate. 

While Pakistan Telecom was attempting to comply with 
a Government mandate to block a particular YouTube vi- 
deo on 24 February 2008, they advertised their own AS 
as the shortest path to a portion of YouTube’s IP-address 
space. This advertisement went out not only to down- 
stream providers within Pakistan, but also to their up- 
stream provider, Pacific Century CyberWorks (PCCW). At 
that time PCCW was not filtering for bogus prefix adver- 
tisements such as this, and it propagated the advertise- 
ment to the rest of the world, causing most HTTP 
requests for YouTube world-wide to be directed to Paki- 
stan Telecom. Over the next couple of hours, several at- 
tempts were made by YouTube to compete with the 
bogus route advertisements using more specific prefixes, 
but the situation did not return to normal until PCCW dis- 
connected Pakistan Telecom entirely. While the global 


scope of this hijacking was most-likely accidental, it clearly 
demonstrates the vulnerability of BGP to route-spoofing 
and still presents a major challenge to resilient Internet 
operation [77-79]. 

In 2009 a small Czech provider (AS number 47868) 
using a MikroTik router caused severe problems to the 
BGP infrastructure [80,81]. An administrator mis-config- 
ured the BGP settings, which triggered a bug in Cisco’s 
BGP implementation. The effect of this problem was ob- 
served world-wide. The administrator’s intention was to 
prepend his own AS number once in order to increase 
the length of the BGP path via this AS. This is a common 
technique that operators use in order to avoid attracting 
traffic from peer providers. Unfortunately the administra- 
tor assumed that the configuration syntax followed Cisco’s 
IOS syntax, used in the dominant router platform. The syn- 
tax of a MikroTik router does not take the AS number that 
should be prepended as an argument, but rather as an inte- 
ger indicating how often the AS number should be pre- 
pended, which led to 47868 mod 256=252 times 
prepending the AS number. Such very long paths should 
be filtered by the receiving routers but actually only few 
routers were configured in this way. A Cisco router receiv- 
ing an update message which contains a route with 255 
ASes would reset itself while processing this message 
due to an interoperability bug in IOS. This combination of 
mis-configuration of a rarely used router, in conjunction 
with poorly configured routers accepting excessively long 
AS paths and the implementation bug in Cisco’s IOS, led 
to a ten-fold increase in global routing instability for about 
an hour. 


3.3. Large-scale disasters 


Large-scale disasters may result from either natural 
causes or from human mistakes. In either case they are a 
unique category of challenges because they result in corre- 
lated failure over a large area, such as destroying hardware 
while simultaneously preventing operators from perform- 
ing normal functions and restricting information access 
by decision-makers, resulting in poor remediation choices. 
Examples of large-scale disasters include hurricanes, 
earthquakes, ice storms, tsunamis, floods, and widespread 
power outages. 

On 29 August 2005, Hurricane Katrina caused massive 
destruction in Louisiana and Mississippi in the Southeast 
US, and disrupted communications with 134 networks 
[82,83]. Many of these disruptions were due to power out- 
ages, and the majority of them were restored within a ten 
day period. One link of the Abilene Internet2 research net- 
work was also taken down, but sufficient capacity was 
available in alternate paths to reroute traffic. There was 
also significant disruption to the PSTN due to the increased 
traffic and destruction of cellular-telephony towers. The 
disaster-recovery efforts which occurred following the 
hurricane also highlighted the challenges resulting from 
incompatible communication equipment used by different 
first responders (local police, fire, state police, coast guard, 
national guard, etc.). 

On 26 December 2006, and continuing for two days, an 
earthquake with a number of major aftershocks took place 
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near Hengchun Taiwan that damaged the submarine 
cables that provide Internet connectivity between Asia 
and North America [84]. 1200 prefix ranges became tem- 
porarily unreachable, China’s internet access capacity was 
reduced by 74%, and Hong Kong’s Internet access was 
completely disabled. While BGP was able to automatically 
reroute some of the traffic, it did so without any knowledge 
of the underlying physical topology or link utilisation, 
resulting in traffic between China and Taiwan crossing 
the Pacific Ocean twice. Manual traffic engineering was re- 
quired to reroute traffic via Korea and Japan, instead of via 
the US. 

Other events that could cause catastrophic power grid, 
radio-communication, and communication network fail- 
ures over a large area include radiation from solar-induced 
geomagnetic storms [85] (with a predicted peak in 2012) 
and attacks from electromagnetic pulse (EMP) weapons 
[86]. Finally, a serious pandemic could have severe impact 
if people are unable or afraid to operate and maintain crit- 
ical infrastructure including the Global Internet [87,88]; 
this was a concern that fortunately did not come to pass 
in the 2009-2010 H1N1 influenza pandemic, but may be 
inevitable in a future avian or swine flu pandemic. 


3.4, Malicious attacks 


Malicious attacks come from cyber-criminals, for rea- 
sons including terrorism, information warfare among na- 
tions, political groups, or competing businesses, as well 
as from recreational crackers including script kiddies. 
These challenges may destroy or damage critical compo- 
nents in the network infrastructure with the intent of dis- 
rupting network services, or disable network infrastructure 
as collateral damage. 

The 9/11 terrorist attacks of 11 September 2001 to New 
York were relatively localised and not targetted against the 
network infrastructure per se, but many subscriber lines 
were affected, because 1-2% of all Internet prefix blocks 
were unreachable at the peak of the disruption [89]. Most 
of these were restored within a day or two, and the impact 
to core Internet infrastructure services such as DNS and 
BGP was minimal. The more noticeable effects were due 
to flash crowds for news-related Web sites, which were 
quickly overloaded, seeing in some cases 350% of their nor- 
mal traffic load. DNS, however, saw only normal load due 
to caching at end systems, and the Internet itself saw lower 
than normal aggregate amounts of traffic, presumably 
because many people were preoccupied watching the tele- 
vised coverage and using the Web less than usual. The 7/7 
terrorist attacks against London Transport on 7 July 2005 
did not directly damage network infrastructure other than 
that used by the London Underground, but also induced 
significant traffic on the mobile telephone network and 
Internet as people tried to get news, and impaired first 
responder ability to communicate with one another [90]. 

Malicious attacks that exploit protocols or software vul- 
nerabilities include distributed denial-of-service (DDoS) 
campaigns that are frequently intended to harm an indi- 
vidual, organisation, corporation, or nation. When the Esto- 
nian government decided to move a Soviet war memorial, 
a DDoS attack was launched from IP-addresses within Rus- 


sia. The politically-motivated attacks targeted a variety of 
Estonian government and business Web sites. Estonia sev- 
ered its Internet connection to the rest of the world to stop 
the effects of attacks [91]. 


3.5. Environmental challenges 


Challenges to the communication environment include 
weak, asymmetric, and episodic connectivity of wireless 
channels; high-mobility of nodes and subnetworks; unpre- 
dictably long delay paths either due to length (e.g. satellite) 
or as a result of episodic connectivity. These challenges are 
addressed by disruption-tolerant networks (DTNs), as de- 
scribed in Section 2.2.3. 


3.6. Failures at a lower layer 


If any of these challenges causes a service failure at a 
particular layer, that failure becomes a challenge to any 
higher level service which depends on the correct behav- 
iour of the layer that fails. This class of failures may induce 
recursive failing of services until the error can be contained 
within a higher-level service. Therefore, symptoms of a 
challenge can often be seen by multiple services. For exam- 
ple, a fibre cut causes the physical lightpath service to fail, 
resulting in the failure of all link-layer connections that 
rely on that lightpath; however, remediation may occur 
at a higher layer by re-routing traffic across an alternative 
fibre. 

In January 2008, cable cuts in the Mediterranean Sea 
caused substantial disruptions in connectivity to Egypt 
and India, with lesser disruptions to Afghanistan, Bahrain, 
Bangladesh, Kuwait, Maldives, Pakistan, Qatar, Saudi Ara- 
bia, and the United Arab Emirates. More than 20 million 
Internet users were affected. The cause of the cuts is as- 
sumed to be unintentional, such as a boat anchor or natural 
wear and abrasion of the submarine cable against rocks on 
the sea floor [92-94]. 

Submarine cable cuts are a common occurrence, with 
an average of one cut occurring every three days world- 
wide. Most of these cuts occur as uncorrelated random 
events and go unnoticed by end users due to redundancy 
in the infrastructure. However in cases such as the Taiwan 
earthquake, and the Mediterranean cable cuts (as well as 
the Baltimore tunnel fire), multiple correlated link failures 
caused major outages, emphasising that redundancy for 
fault tolerance is not sufficient for resilience; geographic 
diversity for survivability is also needed. 


3.7. Summary of challenges and past failures 


While the Global Internet as a whole has proven to be 
relatively resilient to challenges discussed in this section 
due to its scope and geographic diversity, there is reason 
for concern. Some of these challenges have had a signifi- 
cant regional impact, and analyses of vulnerabilities indi- 
cate that a coordinated attack on the _ Internet 
infrastructure such as DNS root servers and IXPs (Internet 
exchange points such as MAE East, West, and Central) 
could have severe consequences. 
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4. ResiliNets framework and strategy 


This section describes frameworks and strategies for 
network resilience. First, several important previous 
frameworks are reviewed that consider dependability, sur- 
vivability, and performability. Then, the comprehensive 
ResiliNets framework and strategy is presented, which 
draws heavily on these frameworks, as well as past work 
in the disciplines presented in Section 2. 


4.1. Previous strategies 


There have been several systematic resilience strate- 
gies, presented in the following subsections: ANSA, T1, 
CMU-CERT, and SUMOWIN. 


4.1.1, ANSA 

The Advanced Networked Systems Architecture (ANSA) 
project [95] covered a number of aspects of large system de- 
sign, including dependability. The dependability manage- 
ment strategy consists of eight stages (based on [96]): 
fault confinement, fault detection (properly error/failure 
detection), fault diagnosis, reconfiguration, recovery, re- 
start, repair, and reintegration. The ANSA framework de- 
fines expectation regions in a two-dimensional value x time 
space to describe acceptable service, and thus considers per- 
formability. Service failures are a mismatch between an 
occurrence in this space and the expectation regions. 


4.1.2. T1 

T1A1.2 Working Group of Alliance for Telecommunica- 
tions Industry Solutions (ATIS) on network survivability per- 
formance has developed a multilevel framework for 
network survivability [97] with four layers: physical consist- 
ing of infrastructure with geographic diversity for surviv- 
ability; system consisting of nodes and links with 
protection switching for survivability; logical consisting of 
capacity on the system layer; and service consisting of voice 
and data circuits with dynamic routing and reconfiguration 
for survivability. This framework quantifies service outages 
as a(U,D,E) triple, in which U is the unservability (an inverse 
dependability metric such as unavailability or failure), D is 
the duration in time, and E is the extent over various param- 
eters including geographic area, population, and services. 
Severity categories of this triplet mapped into three-dimen- 
sional space are categorised as minor, major, or catastrophic. 


4.1.3. CMU-CERT 

The CERT Coordination center at CMU proposed a four- 
step strategy [25] consisting of the “three R’s”: (1) resis- 
tance (traditional security, diversity, redundancy, special- 
ization, trust validation, and observed _ stochastic 
properties); (2) recognition (analytical redundancy and 
testing, intrusion monitoring, system behaviour, integrity 
monitoring); (3) recovery (redundancy, diverse location 
of information resources, contingency planning and re- 
sponse teams); followed by (4) adaptation and evolution. 


4.1.4. SUMOWIN 

The Survivable Mobile Wireless Networking (SUMO- 
WIN) project [26] explored mechanisms and strategies 
for disruption tolerance (before the term was generally 
adopted) in environments where stable end-to-end con- 
nectivity was not achievable. The strategy consists of three 
components: (1) Maintain connectivity when possible 
using techniques such as adaptive transmission power 
control and highly-dynamic MANET techniques (eventual 
stability); (2) Use new forwarding and routing techniques 
that do not require routing convergence to move data to- 
wards the destination, using store-and-forward techniques 
such as store-and-forward buffering and store-and-haul 
(store-carry-forward or ferrying) towards the destination 
when stable end-to-end paths cannot be maintained (even- 
tual connectivity); (3) Use innovate technologies such as sa- 
tellite networks and active (adaptable programmable) 
networking to establish connectivity and maintain stealth. 


4.2. ResiliNets 


The ResiliNets initiative [98] has developed a framework 
for resilient networking [99], initially as part of the Auto- 
nomic Network Architecture (ANA) [100,101] and Post- 
modern Internet Architecture (PoMo) [102,103] projects, 
serving as the basis of the ResumeNet (Resilience and Sur- 
vivability for Future Networking: Framework, Mechanisms, 
and Experimental Evaluation) project [104,105]. This initia- 
tive was heavily influenced by the frameworks described 
above, and can be viewed as a successor and synthesis of 
all of them. The ResiliNets framework is described by a 
set of axioms and a strategy in the rest of this section, and 
by a set of design principles described in Section 5. As 
appropriate, cross references will be given by axiom (An), 
strategy (Sm), or principle (Pm) number. 


4.3. ResiliNets axioms 


Axioms provide the basis for any systematic frame- 
work; we present four basic self-evident tenets that form 
the basis for the ResiliNets strategy. 


AO. Faults are inevitable; it is not possible to construct 
perfect systems, nor is it possible to prevent challenges 
and threats. 

It is not possible to construct fault-free systems, for two 
reasons: First, internal faults are those that arise from 
within a given system due to imperfect designs, and 
while it is theoretically possible to use formal methods 
to design a provably correct system, this remains 
impractical for large complex systems and networks 
for the foreseeable future. Second, external faults are 
exercised by challenges from outside the system, and 
it is neither possible nor practical to predict all such 
challenges (present and future) and design defences 
against them. Threat and challenge models improve 
the ability to prevent external faults, but do not elimi- 
nate them. 

Al. Understanding normal operation is necessary, 
including the environment, and application demands. It 
is only by understanding normal operation that we have 
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any hope of determining when the network is challenged 
or threatened. 
We define normal operation to be the state of the net- 
work when there are no adverse conditions present. This 
loosely corresponds to the conditions for which the cur- 
rent Internet and PSTN are designed, when the network 
is not under attack, the vast majority of network infra- 
structure is operational, and connectivity is relatively 
strong. As an example, the public switched telephone 
network (PSTN) is designed to handle normal time-of- 
day fluctuations of traffic, and even peak loads such as 
Mother’s day. These predictable application demands 
are within normal operation. On the other hand, PSTN 
call load during a disaster such as 9/11 or Hurricane Kat- 
rina, as well as flash crowds to an obscure Web site rep- 
resent traffic that is beyond normal operation. It is 
essential to understand normal operation to be able to 
detect when an adverse event or condition occurs (S2). 

A2. Expectation and preparation for adverse events 

and conditions is necessary, so that defences and detec- 

tion of challenges that disrupt normal operations can 
occur. These challenges are inevitable. 

We define an adverse event or ongoing condition as chal- 

lenging (Section 3) the normal operation of the net- 

work. We can further classify adverse events and 
conditions by severity as mild, moderate, or severe, 
and categorise them into two types: 

(1) Anticipated adverse events and conditions are ones 
that we can predict based either on past events 
(such as natural disasters), and attacks (e.g. viruses, 
worms, DDoS) or that a reasoned threat analysis 
would predict might occur. 

(2) Unanticipated adverse events and conditions are 
those that we cannot predict with any specificity, 
but for which we can still be prepared in a general 
sense. For example, there will be new classes of 
attacks for which we should be prepared. 

It is necessary to expect adverse events and conditions 
in order to design resilient networks, and thus this axiom 
motivates the defend and detect aspects of the resilience 
strategy (S1, S2). 

A3. Response to adverse events and conditions is 

required for resilience, by remediation ensuring correct 

operation and graceful degradation, restoration to normal 
operation, diagnosis of root cause faults, and refinement of 
future responses. 

While it is necessary to expect adverse events and con- 

ditions, it is just as important to take action when chal- 

lenges do occur. This motivates the remediation aspect 
of the resilience strategy (S3). 


4.4, ResiliNets strategy 


The resilience axioms motivate a strategy for resilience, 
based in part on the previous strategies (Section 4.1), 
developed as part of the ResiliNets [98], ANA [100], and 
ResumeNet [104] projects. 


4.4.1. Introduction 
We begin with a motivating example and analogy of a 
mediaeval castle that is designed to be resilient in struc- 


ture and operations. Castles generally contain both inner 
and outer thick walls, perhaps additionally protected by a 
moat or by being located on a mountain; this is a structural 
or passive defence. Additionally, guards are patrolling the 
perimeters and checking the credentials of visitors, an ac- 
tive defence. While the defences are intended to resist at- 
tack, they may be penetrated. An advancing army might 
be successful in launching a trebuchet projectile that blasts 
a hole in the castle wall. The guards detect this adverse 
event, and respond by sending subjects to the inner wall, 
and repel the advancing forces, perhaps by pouring boiling 
oil over the hole to prevent the enemy from entering; these 
are remediation actions. Once the enemy forces have been 
repelled, the outer wall must be repaired, recovering to 
the initial pre-attack state. 

It is also important to analyse what went wrong. A diag- 
nosis of why the projectile was able to penetrate the wall 
may indicate a design flaw in the wall materials or thick- 
ness. Furthermore, an analysis of the entire battle may re- 
fine a number of aspects of the entire process, indicating 
ways to improve detection (for example establishing re- 
mote outposts to observe enemy movement) and remedia- 
tion (improving fighting techniques). 

This example not only motivates the ResiliNets strategy 
presented next, but also indicates that many of these ideas 
are very old, at least in a general context predating their 
application to networks by several centuries. 

We formalise this as a two-phase strategy that we call 
D?R?+DR, as shown in Fig. 3. At the core are passive 
structural defences. The first active phase, D?R?: defend, 
detect, remediate, recover, is the inner control loop and de- 
scribes a set of activities that are undertaken in order for 
a system to rapidly adapt to challenges and attacks and 
maintain an acceptable level of service. The second active 
phase DR: diagnose, refine, is the outer loop that enables 
longer-term evolution of the system in order to enhance 
the approaches to the activities of phase one. The follow- 
ing sections describe the steps in this strategy. As appro- 
priate, principles in Section 5 will be referenced as Pn 
that are motivated by the strategy. All of these strategy 
steps require that the proper design tradeoffs be made 
(P6, P7, P8). 


Diagnose 






Refine 


Fig. 3. ResiliNets strategy. 
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4.4.2. D?R? inner loop 

The first strategy phase consists of a passive core and a 
cycle of four steps that are performed in real time and are 
directly involved in network operation and service provi- 
sion. In fact, there is not just one of these cycles, but many 
operating simultaneously throughout the network for each 
resilient system, triggered whenever an adverse event or 
condition is detected. 


$1. Defend against challenges and threats to normal oper- 
ation. 

The basis for a resilient network is a set of defences that 
reduce the probability of a fault leading to a failure 
(fault tolerance) and reduce the impact of an adverse 
event on network service delivery. These defences are 
identified by developing and analysing threat models 
(P3), and consist of a passive and active component. 
Passive defences are primarily structural, suggesting 
the use of trust boundaries (P5), redundancy (P11) 
and diversity (P12). The main network techniques are 
to provide geographically diverse redundant paths and 
alternative technologies such as simultaneous wired 
and wireless links, so that a challenge to part of the net- 
work permits communication to be routed around the 
failure [106-108]. 

Active defences consist of self-protection (P9) mecha- 
nisms operating in the network that defend against 
challenges, such as firewalls that filter traffic for anom- 
alies and known attack signatures [109], and the even- 
tual connectivity paradigm (P10) that permits 
communication to occur even when stable end-to-end 
paths cannot be maintained. Clearly, defences will not 
always prevent challenges from penetrating the net- 
work, which leads to the next strategy step: detect. 
$2. Detect when an adverse event or condition has 
occurred. 


The next step is to remediate the effects of the detected 
adverse event or condition to minimise the effect on 
service delivery. The goal is to do the best possible at 
all levels after an adverse event and during an adverse 
condition. This requires adaptation (P17) and auto- 
nomic behaviour (P16) so that corrective action can 
be taken at all levels (P13, P15) without direct human 
intervention, to minimise the impact of service failure, 
including correct operation with graceful degradation 
of performance. 

A common example of remediation is for dynamic rout- 
ing protocols to reroute around failures [112,32,45,113, 
114] and for adaptive applications and congestion con- 
trol algorithms to degrade gracefully from acceptable to 
impaired service (Section 6). There may be a number of 
strategies that can be used to remediate against a given 
challenge; a key problem is determining the most 
appropriate one to take [115,116]. 

S4. Recover to original and normal operations. 

Once the challenge is over after an adverse event or the 
end of an adverse condition, the network may remain in 
a degraded state (Section 6). When the end of a chal- 
lenge has been detected (e.g., a storm has passed, which 
restores wireless connectivity), the system must 
recover to its original optimal normal operation (P2), 
since the network is likely not to be in an ideal state, 
and continued remediation activities may incur an 
additional resource cost. However, this may not be 
straightforward. For example, it may not be clear when 
to revoke a remediation mechanism that is attributed to 
a particular challenge, as it may be addressing another 
problem. 


4.4.3. DR outer loop 


The second phase consists of two background operations 


The second step is for the network, as a distributed sys- 
tem as well as individual components such as routers, 
to detect challenges and to understand when the 
defence mechanisms have failed. There are three main 
ways to determine if the network is challenged. The 
first of these involves understanding the service 
requirements (P1) and normal operational behaviour 
(Al, P2) of a system and detecting deviations from it 
- anomaly detection based on metrics (P4) [110]. The 
second approach involves detecting when errors occur 
in a system, for example, by calculating cyclic-redun- 
dancy checks (CRCs) to determine the existence of bit 
errors that could lead to a service failure. Finally, a sys- 
tem should detect service failures; an essential facet of 
this is an understanding of service requirements (P1). 
An important aspect of detecting a challenge is deter- 
mining its nature, which requires context awareness 
(P14). For example, in an environmental monitoring 
system a sensor may be reporting anomalous readings; 
this could be as a result of the observed environment 
(e.g., there is a flood event) or device failure. Detection 
along with this understanding will inform an appropri- 
ate remediation strategy [111]. 


S3. Remediate the effects of the adverse event or condi- 
tion. 


that observe and modify the behaviour of the D?R? cycle: 
diagnosis of faults and refinement of future behaviour. 
While currently these activities generally have a significant 
human involvement, a future goal is for autonomic systems 
to automate diagnosis and refinement (P16). 


S5. Diagnose the fault that was the root cause. 

While it is not possible to directly detect faults (Sec- 
tion 2.1), we may be able diagnose the fault that caused 
an observable error. In some cases this may be auto- 
mated, but more generally it is an offline process of 
root-cause analysis [10]. The goal is to either remove 
the fault (generally a design flaw as opposed to an 
intentional design compromise) or add redundancy for 
fault tolerance so that service failures are avoided in 
the future. An example of network-based fault diagno- 
sis is the analysis of packet traces to determine a proto- 
col vulnerability that can then be fixed. 

S6. Refine behaviour for the future based on past D?R? 
cycles, 

The final aspect of the strategy is to refine behaviour for 
the future based on past D?R? cycles. The goal is to learn 
and reflect on how the system has defended, detected, 
remediated, and recovered so that all of these can be 
improved to continuously increase the resilience of 
the network. 
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This is an ongoing process that requires that the net- 
work infrastructure, protocols, and resilience mecha- 
nisms be evolvable (P17). This is a_ significant 
challenge given the current Internet hourglass waist 
[117] of IPv4 BGP, and DNS, as well as other mecha- 
nisms (e.g. NAT) and protocol architectures (e.g. TCP 
and HTTP) that are entrenched and resist innovation. 


5. ResiliNets design principles 


The past experience of the resilience disciplines, basis 
of the axioms, and synthesis of the D*R? + DR strategy 
leads to a set of principles for the design of resilient 
networks and systems. There is a careful balance to be 
struck in any system of design principles: a large enough 
number to provide specific guidance in the architecture 
and design of resilient networks, but a small enough 
number to be manageable without being overwhelming. 
The resilience principles are shown in Fig. 4, clustered 
in the major categories prerequisites, tradeoffs, enablers, 
and behaviour. 


5.1. Prerequisites 


Five principles span the domain of prerequisites neces- 
sary to build a resilient system. 


P1. Service requirements of applications need to be 
determined to understand the level of resilience the 
system should provide. In this sense, resilience may 
be regarded as an additional QoS property along with 
conventional properties such as performance. 

P2. Normal behaviour of the network is a combination 
of design and engineering specification, along with 
monitoring while unchallenged to learn the network’s 
normal operational parameters [118,119]. This is a fun- 
damental requirement (A1) for detecting (S2) chal- 
lenges [120,121]. 

P3. Threat and challenge models [122,123] are essen- 
tial to understanding and detecting potential adverse 
events and conditions. It is not possible to understand, 
define, and implement mechanisms for resilience that 
defend against, detect, and remediate (S1-3) challenges 
without such a model. 

P4. Metrics quantifying the service requirements and 
operational state are needed to measure the operational 
state (in the range normal < partially-degraded — 













service 


requirements eons 
normal tradeoffs 


behaviour 







threat and 


challenge models 
9 state 







metrics 





heterogeneity 


prerequisites tradeoffs 


self-protection 







connectivity 


complexity 





management /context awarenes 
translucency 


enablers 


severely-degraded) and service state (in the range 
acceptable < impaired ~ unacceptable) to detect and 
remediate (S1-2) and quantify resilience to refine future 
behaviour (S6). This is discussed further in Section 6. 
P5. Heterogeneity in mechanism, trust, and policy are 
the realities of the current world. No single technology 
is appropriate for all scenarios, and choices change as 
time progresses. Therefore it is increasingly unrealistic 
to consider the set of global networks as a homoge- 
neous internetwork. The Global Internet is a collection 
of realms [124-126] of disparate technologies [102]. 
Furthermore, realms are defined by trust and policy, 
across which there is tussle [127]. Resilience mecha- 
nisms must deal with heterogeneous link technologies, 
addressing, forwarding, routing, signalling, traffic, and 
resource management mechanisms. Resilience mecha- 
nisms must also explicitly admit trust and policy tus- 
sles. These realms can also serve to enhance resilience 
by giving self-protection (P9) boundaries in which to 
operate. 


5.2. Design tradeoffs 


Three principles describe fundamental tradeoffs that 
must be made while developing a resilient system. 


P6. Resource tradeoffs determine the deployment of 
resilience mechanisms. The relative composition and 
placement of these resources must be balanced to opti- 
mise resilience and cost. The maximum availability of a 
particular resource serves as a constraint in these opti- 
mizations. Resources to be traded against one another 
include bandwidth, memory [128], processing, latency 
[129], energy, and monetary cost. Of particular note is 
that maximum resilience can be obtained with unlim- 
ited cost, but there are cost constraints that limit the 
use of enablers such as redundancy and diversity 
(P11-P12). 

P7. Complexity of the network results due to the inter- 
action of systems at multiple levels of hardware and 
software, and is related to scalability. While many of 
the resilience principles and mechanisms increase this 
complexity, complexity itself makes systems difficult 
to understand and manage, and thereby threatens resil- 
ience. The degree of complexity [130-132] must be 
carefully balanced in terms of cost vs. benefit, and 
unnecessary complexity should be eliminated. 
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Fig. 4. Resilience principles. 
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P8. State management is an essential part of any large 
complex system. It is related to resilience in two ways: 
First, the choice of state management impacts the resil- 
ience of the network. Second, resilience mechanisms 
themselves require state and it is important that they 
achieve their goal in increasing overall resilience by 
the way in which they manage state, requiring a trade- 
off among the design choices. Resilience tends to favour 
soft, distributed, inconsistency-tolerant state rather 
than hard, centralised, consistent state, but careful 
choices must be made in every case. 


5.3. Enablers 


Seven principles are enablers of resilience that guide 


network design and engineering. 


P9. Self-protection and security are essential proper- 
ties of entities to defend against challenges (A2) in a 
resilient network. Self-protection is implemented by a 
number of mechanisms, including but not limited to 
mutual suspicion, the AAA mechanisms of authentica- 
tion, authorisation, and accounting, as well as the 
additional conventional security mechanisms of confi- 
dentiality, integrity, and nonrepudiation. 

P10. Connectivity and association among communi- 
cating entities should be maintained when possible 
based on eventual stability, but information flow should 
still take place even when a stable end-to-end path does 
not exist based on the eventual connectivity model 
[133,26]; the use of disruption-tolerant networking 
(DTN) techniques such as partial paths [134], store- 
and-forward with custody transfer [35,36], and store- 
and-haul [26,37,38] permit this. 

P11. Redundancy in space, time, and information 
increases resilience against faults and some challenges 
if defences (S1) are penetrated. Redundancy refers to 
the replication of entities in the network, generally to 
provide fault tolerance. In the case that a fault is acti- 
vated and results in an error, redundant components 
are able to operate and prevent a service failure. Spatial 
redundancy examples are triple-modular redundant 
hardware [19] and parallel links and network paths. 
Examples of temporal redundancy are erasure coding 
[135] consisting of repeated transmission of packets, 
periodic state synchronisation, and periodic informa- 
tion transfer (e.g. digital fountain [136]). Information 
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redundancy is the transmission or storage of redundant 
information, such as forward error correction (FEC). It is 
important to note that redundancy does not inherently 
prevent the redundant components from sharing the 
same fate. 

P12. Diversity is closely related to redundancy, but 
has the key goal to avoid fate sharing. Diversity in 
space, time, medium, and mechanism increases resil- 
ience against challenges to particular choices [137- 
139,24]. Diversity consists of providing alternatives 
so that even when challenges impact particular alter- 
natives, other alternatives prevent degradation from 
normal operations. Diverse alternatives can either be 
simultaneously operational, in which case they defend 
(S1) against challenges [140], or they may be avail- 
able for use as needed to remediate (S3) [141]. Spatial 
diversity requires redundancy of a degree at least 
equal to the degree of diversity, and can be catego- 
rised as topological diversity across the (logical) topol- 
ogy of the network [142,24], and geographic diversity 
across the physical topology of the network [45, 
106]; note that topologically diverse links or nodes 
may be physically co-located. Temporal diversity is 
intentional variance in the temporal behaviour of a 
component or protocol, such as variation in timing 
of protocol state transitions to resist traffic analysis. 
Operational diversity refers to alternatives in the archi- 
tecture and implementation of network components 
and protocols, that is the avoidance of monocultures, 
and may be categorised as: implementation diversity 
that prohibits systems from exhibiting the same error 
caused by an implementation fault (e.g. N-version 
programming [20]), by deploying systems software 
from multiple vendors and routers from different 
hardware vendors; medium diversity that provides 
choices among alternative physical media through 
which information can flow, such as wired and wire- 
less links; and mechanism diversity that consists of 
providing alternative mechanisms, such as FEC and 
ARQ-based error control. 

Fig. 5 shows an example of several kinds of diversity. 
Communicating subscribers are multihomed to service 
providers that are diverse in both geography and 
mechanism. Protection against a fibre cut is provided 
by the wireless access network; protection against 
wireless disruptions such as weather or jamming is 
provided by the fibre connection. 
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Fig. 5. Diversity example. 
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Table 1 important that level boundaries be translucent in which 





Levels and selected resilience mechanisms. 
Level Mechanism 
Application Adaptive applications 
Transport Eventual connectivity, erasure codes 
Internetworking Heterogeneity, realm diversity 
Path Multipath spreading, medium diversity 
Topology k-connected, geographic graph diversity 
Links and nodes Link error control, fault tolerance 
Physical channel Robust coding 





P13. Multilevel resilience is defined with respect to 
protocol layer, protocol plane, and hierarchical network 
organisation [97,143,144]. Multilevel resilience is 
needed in three orthogonal dimensions: Protocol layers 
in which resilience at each layer provides a foundation 
for the next layer above; planes: data, control, and man- 
agement; and network architecture inside-out from fault 
tolerant components, through survivable subnetwork 
and network topologies, to the global internetwork 
including attached end systems. 

A given level is designed to be as resilient as is practica- 
ble given cost constraints (P6); this provides a founda- 
tion for the next level, as shown in Table 1. At the 
physical layer, robust coding optimises the probability 
of bits arriving uncorrupted to their next hop, but this 
will not be perfect in hostile environments. Therefore, 
at the link layer, we apply error control to provide as 
reliable a link as practical, but this will not be perfect, 
and links may be cut or intermittently connected. We 
ensure that geographically and technologically diverse 
links and nodes are available (P12). This permits us to 
construct network topologies that have not only redun- 
dancy but geographic diversity so that link cuts do not 
partition the network. This in turn provides the founda- 
tion for the network layer to constructs paths even 
when links are down and part of the topology is not sta- 
ble, by using techniques such as multipath and fast- 
reroute. A heterogeneous (P5) set of subnetworks form 
the Global Internet, over which a resilient end-to-end 
transport layer uses principles such as eventual connec- 
tivity (P10) and erasure coding to transfer data end-to- 
end, even when end-to-end stable paths cannot be 
maintained. Finally, adaptive applications should be 
tolerant of poor end-to-end transport. 

P14. Context awareness is needed for resilient nodes to 
monitor the network environment (channel conditions, 
link state, operational state of network compo- 
nents, etc.) and detect adverse events or conditions 
[145,146,45]. Remediation (S3) mechanisms must take 
the current context of system operation into account. 
P15. Translucency [147,129] is needed to control the 
degree of abstraction vs. the visibility between levels. 
Complex systems are structured into multiple levels 
to abstract complexity and separate concerns. In the 
case of networks this consists of three multilevel 
dimensions: layer, plane, and system organisation. 
While this abstraction is important, an opaque level 
boundary can hide too much and result in suboptimal 
and improper behaviour based on incorrect implicit 
assumptions about the adjacent level [46]. Thus it is 


cross-layer control loops allow selected state to be 
explicitly visible across levels; dials expose state and 
behaviour from below; knobs influence behaviour from 
above [102]. 


5.4. Behaviour needed for resilience 


The last group of three principles encompass the behav- 
iours and properties a resilient system should possess. 


P16. Self-organising and autonomic behaviour 
[148,149,101] is necessary for network resilience that 
is highly reactive with minimal human intervention. A 
resilient network must initialise and operate itself with 
minimal human configuration, management, and inter- 
vention. Ideally human intervention should be limited 
to that desired based on high-level operational policy. 
The phases of autonomic networking consist of: (1) ini- 
tialisation consisting of auto-configuration of network 
components and their self-organisation into a network 
[150,69]; (2) steady-state normal operation consisting 
of self-managing with minimal human interaction dic- 
tated by policy and self-optimising to dynamic network 
conditions, and (3) steady-state expecting faults and 
challenges consisting of self-diagnosing [151] and self- 
repair. 

P17. Adaptability to the network environment is essen- 
tial for a node in a resilient network to detect, remedi- 
ate, and recover from challenges. Resilient network 
components need to adapt their behaviour based on 
dynamic network conditions, in particular to remediate 
(S3) from adverse events or conditions, as well as to 
recover (S4) to normal operations. At the network level, 
programmable and active network techniques enable 
adaptability [152,153]. 

P17. Evolvability [154] is needed to refine (S6) future 
behaviour to improve the response to challenges, as 
well as for the network architecture and protocols to 
respond to emerging threats and application demands. 
Refinement of future behaviour is based on reflection 
on the inner strategy loop: the defence against, detec- 
tion, and remediation of adverse events or conditions 
and recovery to normal operation (S1-4). Furthermore, 
it is essential that the system can cope with the evolu- 
tion and extension of the network architecture and pro- 
tocols over time, in response to long term changes in 
user and application service requirements, including 
new and emerging applications technology trends, as 
resource tradeoffs change, and as attack strategies and 
threat models evolve. 


6. Resilience analysis 


In this section, we address the fundamental question of 
how to measure and quantify network resilience. To develop 
a comprehensive understanding of network resilience, 
methodologies are needed to measure the resilience (or 
lack thereof) of a given network and evaluate the benefit 
of proposed architectures, principles, and mechanisms. 
Traditionally, both resilience mechanisms and measures 
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have been domain specific as well as challenge specific. For 
example, existing research on fault tolerance measures 
such as reliability and availability targets single instances 
of random faults, such as topology based survivability 
analysis, considering node and link failures [155-157]. 
More recently, generic survivability frameworks consider 
network dynamics in addition to infrastructure failures 
[29,158,159]. Survivability can be quantified based on 
availability and network performance models [160- 
162,30,163] using the T1A1.2 working group definition of 
survivability [52,13]. Resilience can be quantified as the 
transient performance and availability measure of the net- 
work when subjected to challenges outside of the design 
envelope [164]. Service oriented network measures in- 
clude user lost erlangs (measuring the traffic capacity lost 
during an outage) and unservability [165,54]. Based on 
the common distinction in the industry between equip- 
ment vendor, service provider, and end user, specific met- 
rics have been developed for each domain. In the field of 
network security, the commonly taken approach is to per- 
form a vulnerability analysis [166,167,55] in order to 
determine how a network responds to security risks. Resil- 
ience evaluation is more difficult than evaluating networks 
in terms of traditional security metrics, due to the need to 
evaluate the ability of the network to continue providing 
an acceptable level of service, while withstanding chal- 
lenges as defined in Section 2.3.1 [167,168]. 

Evaluating network resilience in this way effectively 
quantifies it as a measure of service degradation in the 
presence of challenges (perturbations) to the operational 
state of the network. Hence, the network can be viewed 
(at any layer) as consisting of two orthogonal dimensions 
as shown in Fig. 6: one is the operational state of the net- 
work, which consists of its physical infrastructure and their 
protocols; the second dimension is the services being pro- 
vided by the network and its requirements [168]. 

Formally, let resilience be defined at the boundary 
By between any two adjacent layers L;,L;. In order to 
characterise the state of the network below the bound- 
ary By, let there be a set of k operational metrics N = 
{N,,N2,...,N,}. To characterise the service from layer j 


Operational State N 


Normal Partially Severely 
Operation Degraded Degraded 


Unacceptable ‘s,) 


. Remediate 
Impaired 


Service Parameters P 


Acceptable 





Fig. 6. Resilience state space. 


to layer i, let there be a set of | service parameters 
P = {P1,P2,...,P)}. Note that both of these dimensions 
are multi-variate and there is a clear mapping between 
the operational metrics and service parameters. Simply 
put, for a given set of operational conditions, the network 
provides a certain level of service for a given application. 
Thus, the state of a network is an aggregate of several 
points in this two-dimensional space. 

In order to limit the number of states, the operational 
and service space of the network may be divided into three 
regions each as shown in Fig. 6. The network operational 
space is divided into normal, partially-degraded, and se- 
verely-degraded regions. Similarly, the service space is di- 
vided into acceptable, impaired, and unacceptable regions. 
While an arbitrary number of such regions is possible, 
one of the primary goals of this work is to achieve tractable 
yet useful solutions, and this set of nine (3 x 3) regions 
provides the necessary abstraction while limiting the num- 
ber total regions. Each region may contain multiple states 
if the service demands such a fine granularity. In the limit- 
ing case, each region represents just one state. 


6.1. State transitions and resilience evaluation 


Applying the fault — error — failure chain (Section 2.1), 
adverse events are manifest as degradations in the opera- 
tional condition of the network. Hence, when such events 
degrade the operational state of the network, the level of 
service being provided degrades as well resulting in state 
transitions, e.g., Sg + S; and So — S2. The boundaries of 
each state and the number of states are determined by 
the application being supported as well as the expected 
service. Resilience Rj at the boundary By is then evaluated 
as the transition of the network through this state space. 
The goal is to derive the Ry as a function of N and P. In 
the simplest case Ry is the slope of the curve obtained by 
plotting P vs. N on a multi-variate piecewise axis. For 
example, when comparing two services over a given net- 
work, the service with a smaller slope (So — S1) is consid- 
ered more resilient as shown in Fig. 6. There are two 
potential issues with this approach: (1) The number of 
metrics in each dimension may be very large and (2) there 
may be state explosion due to the number of quantifiable 
network states. Fortunately, in this context, both problems 
can be avoided. First, the number of metrics can be limited 
to those that affect the service under consideration. Sec- 
ondly, the number of states are limited by the granularity 
of the service differentiation required as illustrated below. 
Finally, all transients in network operation (originating 
from network dynamics) that result in a predetermined 
service level are aggregated into one state. This approach 
significantly reduces the number of states. 

A network’s resilience is then evaluated based on the 
range of operational conditions for which it stays in the 
acceptable service region (irrespective of the specific 
state), and to declare a network resilient implies that it 
stays in the acceptable service region across a wide range 
of operational conditions. An ideal network would be one 
that stays in the acceptable service region (all its states dis- 
tributed within the acceptable service region) for all net- 
work conditions under the presence of any number of 
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challenges, but given that ideal systems are impractical, 
the objective is to design networks that stay in the accept- 
able state as long as possible and degrade gracefully in the 
face of increasingly severe challenges. 

Fig. 7 shows the relationship of the D?R? + DR strategy 
(Section 4) to the 2-dimensional state space. The left part 
shows the fault > error > failure chain (Fig. 1) with de- 


fence resisting the activation of faults and propagation of 
errors to service failures, as previously described. Chal- 
lenges and errors can be detected to initiate remediation. 
The right side of the figure shows the operational dimen- 
sion N and service dimension P as ranges that are de- 
graded when errors propagate to operational state and 
when the effects of the operational state impact the service 
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Fig. 7. ResiliNets strategy block diagram. 





Fig. 8. Simulated area-based challenges for GEANT2 topology. 
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state (Sp — S2 trajectory in Fig. 6). Remediation mecha- 
nisms help drive the operational state towards improve- 
ment (S; — $3 trajectory), and service resilience resists 
the effects of degraded operational state from impairing 
the service (Sg — S, trajectory). Recovery moves the oper- 
ational state back to normal operation at the end of the in- 
ner strategy loop (S3 — So trajectory). Diagnosis and 
refinement are the outer loop, used to improve this entire 
process. 


6.2. Resilience analysis scenario 


To perform this evaluation, challenges are applied to 
the network that induce failures in links, nodes, or proto- 
cols. As an example consider the application of the state 
space to the link/topology boundary in Table 1. The hori- 
zontal N axis represents metrics that describe the opera- 
tional state of the network in terms of link failures 
(below the link/topology service line). Zero (or a small 
number) of failures are within normal operation, and some 
larger number of failures are defined as being severely de- 
graded. Challenges to the network that cause links to fail 
then cause the state to move from normal operations So 
to the right. Fig. 8 shows an example network to which 
we might apply such a challenge: the GEANT2 research 
network. 

The vertical P axis is the service delivered above the 
link/topology line, which is a connected diverse topology, 
and this is measured by metrics such as number of parti- 
tions, k-connectedness, and total graph diversity [106]. 
Consider the example of an area-based challenge [27,28] 
shown by the squares in Fig. 8, which might be the result 
of a major power failure or storm. As the impacted area in- 
creases, the topology goes from acceptable Sy upwards, 
resulting in an impaired topology measured as loss of 
diversity (the smallest square), to an unacceptable topol- 
ogy that is partitioned (the largest square). This particular 
example would follow a trajectory So — S> in Fig. 6, how- 
ever if more links existed (for example from the UK to Ice- 
land and Moscow to Turkey), a trajectory closer to Sp — S; 
would result since the network would not partition. This 
analysis can continue up the levels, and be performed 
using a combination of analysis and simulation, particu- 
larly at the higher levels to determine the effects on end- 
to-end protocols and applications. 


7. Summary and research directions 


The Internet has become essential to all aspects of mod- 
ern life, and thus the consequences of network disruption 
have become increasingly severe. It is widely recognised 
that the Internet is not sufficiently resilient, survivable, 
or dependable, and that significant research, development, 
and engineering are necessary to improve the resilience of 
its infrastructure and services. Substantial research has 
previously gone into the disciplines that are part of resil- 
ience, and it is now possible to relate these to one another 
and derive an architectural framework to improve the 
resilience of existing networks, to guide the inclusion of 


resilience as a first-class design property in the evolving 
and future Internet. 

This paper has presented a systematic architectural 
framework that unifies resilience disciplines, strategies, 
principles, and analysis. The D?R?+DR (defend, detect, 
remediate, recover + diagnose, refine) ResiliNets strategy 
leads to a set of design principles that guide the analysis 
and design of resilient networks. The principles encompass 
prerequisites, tradeoffs, enablers, and behaviours for resil- 
ient network architecture and design. We believe this to be 
the most comprehensive resilience framework to date, 
which builds upon and unifies previous frameworks such 
as ANSA, T1, CMU-CERT, and SUMOWIN. 

While most of the technical areas covered by the Resili- 
Nets framework are the subject of intense investigation by 
a large number of researchers, some are more mature than 
others. We believe that significantly more work needs to 
be done in the areas of structural defences and remediation 
mechanisms, understanding and defining resilience met- 
rics, and the refinement aspects of the outer control loop. 
Future Internet research and infrastructure programmes 
such as FIND [169], FIRE [170], and GENI [171], and pro- 
jects within these programmes such as PoMo [102,103], 
ResumeNet [104,105], and GpENI [172,173] provide a fer- 
tile context in which to perform this research and 
experimentation. 
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